One Y Variable

Module 2.3: Distributions

Alex Cardazzi

Old Dominion University

Distributions

So far, we have characterized certain “moments” of data (i.e., mean and variance).

While these two numbers are helpful, they do not tell the full story. As an example, let’s consider the following data:

Code
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/anscombe.csv")
cat("Summary Statistics for x1. Mean:", mean(df$x1), "; Variance:", var(df$x1), "\n")
cat("Summary Statistics for x4. Mean:", mean(df$x4), "; Variance:", var(df$x4))
Output
Summary Statistics for x1. Mean: 9 ; Variance: 11 
Summary Statistics for x4. Mean: 9 ; Variance: 11

Now, check out the underlying data:

Code
df[,c("x1", "x4")]
Output
   x1 x4
1  10  8
2   8  8
3  13  8
4   9  8
5  11  8
6  14  8
7   6  8
8   4 19
9  12  8
10  7  8
11  5  8

Distributions

Clearly, even those data have the same mean and variance, they have very different distributions. This is different than the case of Michael Jordan and UNC. There, the distributions were in fact very similar except for Michael Jordan’s outlier salary.

Before we can keep talking about distributions, we need to discuss probability briefly.

Probability

Probability

A value between 0 and 1 inclusive that represents the likelihood a particular event happens.

The probability that a coin is flipped and it lands on tails is \(\frac{1}{2}\) or \(0.5\) or \(50\%\).1 What about the probabilities and outcomes of flipping two coins and counting the number of tails? Well, since flipping this coin is a random variable, it has a distribution. That distribution is defined by all possible outcomes and the associated probabilities:

Probability Distribution of Two Coin Flips
Number of Tails, \(y\) Probability, \(P(y)\)
0 0.25
1 0.50
2 0.25

Probability

All distributions have a mean and a variance.1 So, how do we find the mean and variance of a probability distribution?

\[\mu = \sum_{i = 1}^g P(y_i) \times y_i\]

\[\sigma^2 = \sum_{i = 1}^g P(y_i) \times (y_i - \mu)^2\]

Note: this should look a lot like the formula for weighted average.

Probability

To practice, let’s use the previous distribution of flipping two coins.

\[\mu = \sum_{i = 1}^g P(y_i) \times y_i = (0.25\times0) + (0.5\times1) + (0.25\times2) = 1\]

\[\begin{aligned}\sigma^2 &= \sum_{i = 1}^g P(y_i) \times (y_i - \mu)^2 \\ &= (0.25\times(0-1)^2) + (0.5\times(1-1)^2) + (0.25\times(2-1)^2) \\ &= (0.25\times(-1)^2) + (0.5\times(0)^2) + (0.25\times(1)^2) \\ &= 0.25 + 0 + 0.25 = 0.5\end{aligned}\]

Probability

Flipping coins is an example of a discrete distribution. In other words, individual events have non-zero probabilities of occurring. However, when distributions are continuous, the probability of any specific outcome occurring is precisely zero. What? How can that be true?

Consider the lifespan of a tire on a car. The average tire lasts about 54,000 miles. However, what is the probability that your tire lasts exactly 54,000 miles? Essentially 0%.

This is because your tire could last 54,000.1 miles, 54,000.01 miles, and so on. There are an infinite number of miles that your tire could last, so the probability of any specific mile being the one where your tire blows is zero.

Probability

When dealing with continuous probabilities, instead of asking for the \(P(y) = M\) (e.g., the probability that your tire lasts exactly \(M\) miles), people ask for the \(P(y) < M\), \(P(y) > m\), or \(m < P(y) < M\) (e.g., the probability that your tire lasts more or less than some number of miles, or the probability that your tire lasts between two numbers of miles.)

In the interest of being concrete, suppose tires last anywhere between 40,000 and 68,000 miles. The distribution is uniform, meaning that the probability of your tire blowing up is the same from 40,000 to 68,000 miles. A question from a statistics class might be: “what is the probability that your tire lasts more than 60,000 miles?”

To answer this, you need to know what fraction of tires last above 60,000. Well, since all tires blow up between 40 and 68, we can say the baseline is 68-40 = 28. Then, 68-60 is the range we’re interested in. So, the probability would be \(\frac{8}{28} = 0.286\).

The Normal Distribution

The above math highlights how you would deal with probabilities and continuous distributions. However, not all continuous distributions are “uniform” like we assumed above. Perhaps the most famous and widely studied distribution is the normal distribution.

The normal distribution:

  • is bell-shaped with it’s peak in the center
  • is symmetric, meaning the mean is the same as the median, and it asymptotically approaches the x-axis (meaning it gets close but never touches it!)
  • is completely characterized by its mean and standard deviation, and is written like \(N(\mu, \sigma)\)

The Normal Distribution

Below is an example of what a normal distribution looks like:

Plot

Example of a normal distribution

Example of a Normal Distribution

Multiple Normal Distributions

Normal distributions take on different shapes depending on their means and standard deviations.

  • A increasing the mean, while holding the standard deviation constant, results in the distribution sliding horizontally to the right along the x-axis.
  • An increase in the standard deviation, while holding the mean constant, results in a shorter, wider distribution.

Below are some examples of what these changes would do to a standard normal distribution:

Plot

Examples of multiple normal distributions with different means and variances.

Example of Multiple Normal Distributions

Normal Distribution Application

To learn the most incredible feature of the normal distribution, you’ll have to wait for later in the module. However, while we wait, let’s discuss how to compare two different normal distributions.

Two particularly famous distributions that are normally distributed are scores on the SAT and ACT. The maximum score for each is 1600 and 36, respectively. The SAT has an average score of 1050 with a standard deviation of 200. The ACT has an average score of 19 with a standard deviation of 6.

So, let’s say someone scores a 1,300 on the SAT and a 29 on the ACT. Which test did they do better on?

Normal Distribution Application

Well, let’s first visualize the two distributions and the two scores.

Plot

SAT and ACT distributions.

Normal Distribution Application

Well, this visualization is not helpful. Since the two distributions have different means and standard deviations, we cannot really compare them. A first step to getting a better comparison would be to subtract off the mean from both distributions. Then, both distributions will have a mean of zero.

Plot

SAT and ACT distributions centered at 0.

Normal Distribution Application

Still, this is not helpful. Now, the variances are distorting the comparison. To fix this, we can divide each by their respective variances. This way, both distributions have a mean of zero and a variance of 1.

Plot

SAT and ACT distributions centered at zero and scaled such that their standard deviations are both 1.

Normal Distribution Application

Now, it should be quite obvious which score was “better”. The SAT score currently has a value of 1.25 while the ACT score has a value of 1.6666667.

What do these numbers mean? In words, this is the number of standard deviations away from the mean the original score is. This actually has a name, and it is a \(z\)-score. The formula for a \(z\)-score is:

\[z = \frac{y - \mu}{\sigma}\]

Normal Distribution Application

\(z\)-scores come from what is called a standard normal distribution, which is just a normal distribution with a mean of 0 and standard deviation (and variance) of 1, or \(N(0, 1)\). We know a lot about the standard normal distribution. For example, we know the probability that a random draw from a normal distribution is greater than some number, less than some number, or between two numbers. In most statistics courses, this is where you’d be shown a \(z\)-score table. Rather than spending time on this table, here are some takeaways:

  • Since the distribution is symmetric, the probability of a random draw being larger (or smaller) than 0 is \(\frac{1}{2}\).
  • There is a 15.87% chance that a random draw is larger (smaller) than +1 (-1).
  • There is a 2.28% chance that a random draw is larger (smaller) than +2 (-2).
    • This means that there is a 95.44% chance that the random draw is between -2 and +2.

Normal Distributions in R

R has some pre-built functions to work with normal distributions.

  • pnorm() gives \(P(Z < y)\) when lower.tail = TRUE. When lower.tail = FALSE, this gives \(P(Z > y)\)
    • pnorm(0, lower.tail = TRUE): 0.5
    • pnorm(2, lower.tail = FALSE): 0.0227501.
  • qnorm() is the opposite of pnorm() in that it converts probabilities into \(z\)-scores.
  • rnorm() generates random draws from a normal distribution with a given mean and standard deviation. The default mean is 0 and the default standard deviation is 1.

Arbitrary Distributions

What about arbitrary distributions that we observe in data? We know how to calculate the mean and variance of a random variable, but it’s helpful to visualize the distribution.

If your data is sufficiently discrete, I prefer to use table(). Of course, “sufficiently discrete” is difficult to define. If your data has many unique values, you may need to transform the data via rounding before using table(). Otherwise, you can use hist() to generate a histogram (which is effectively what table() is doing) or density() to create a kernel density plot. You can think of kernel density plots as smoothed out histograms.

Arbitrary Distributions

Let’s take a look at some real estate transaction data as an example.

Code
# https://vincentarelbundock.github.io/Rdatasets/doc/AER/HousePrices.html
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/HousePrices.csv")
# Keep only columns 2 through 6.
# These rows are the ones that contain numeric data.
# What this step below is doing is sub-setting the columns,
#   then saving over the original.
df <- df[,2:6]
head(df)
Output
  price lotsize bedrooms bathrooms stories
1 42000    5850        3         1       2
2 38500    4000        2         1       1
3 49500    3060        3         1       1
4 60500    6650        3         1       2
5 61000    6360        2         1       1
6 66000    4160        3         1       1

Arbitrary Distributions

First, let’s examine the distribution of bedrooms.

Code
plot(table(df$bedrooms),
     xlab = "Number of Bedrooms",
     ylab = "Frequency")
Plot

Distribution of number of bedrooms in housing transaction data.

Arbitrary Distributions

Next, let’s try to plot the distribution of sale price.

Code
plot(table(df$price), xlab = "Sale Price", ylab = "Frequency")
Plot

Distribution of sale price in housing transaction data.

This is not a very informative plot as there’s no discernible pattern. Rather, let’s divide by 10,000, round the data, and re-multiply by 10,000.

Code
plot(table(round(df$price/10000)*10000), xlab = "Sale Price", ylab = "Frequency")
Plot

Distribution of sale price in housing transaction data.

This is a much better plot. We can see that were are some outliers that are over $150,000, and the modal sale is about $60,000.

Arbitrary Distributions

Finally, we want a way to tabulate multiple summary statistics for multiple variables. To do this, we can use R’s summary() function.

Code
summary(df)
Output
     price           lotsize         bedrooms       bathrooms    
 Min.   : 25000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
 1st Qu.: 49125   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
 Median : 62000   Median : 4600   Median :3.000   Median :1.000  
 Mean   : 68122   Mean   : 5150   Mean   :2.965   Mean   :1.286  
 3rd Qu.: 82000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
 Max.   :190000   Max.   :16200   Max.   :6.000   Max.   :4.000  
    stories     
 Min.   :1.000  
 1st Qu.:1.000  
 Median :2.000  
 Mean   :1.808  
 3rd Qu.:2.000  
 Max.   :4.000  

Arbitrary Distributions

However, summary() isn’t our best option for creating a table. We are going to use our first R package for this task. Using an R package takes two steps. First, you will need to install the package (download the package from the internet onto your computer), and then call the package (loading the package into the current R session). You only need to install the package once, but you need to call the package every time you use it.

Below, I am installing the modelsummary package and then loading it into our session. You can find the documentation for modelsummary here. We also need to install kableExtra, but we won’t use it until later.

Code
# only use install.packages() once per package
install.packages("kableExtra")
install.packages("modelsummary")
# use this every time you want to use 'modelsummary'
library("modelsummary")

Arbitrary Distributions

This package contains a few different functions, but we are going to focus on two: datasummary() and datasummary_skim(). You can find the documentation for datasummary here. To start, let’s see the output of datasummary_skim().

Code
datasummary_skim(df)
Unique (#) Missing (%) Mean SD Min Median Max
price 219 0 68121.6 26702.7 25000.0 62000.0 190000.0
lotsize 284 0 5150.3 2168.2 1650.0 4600.0 16200.0
bedrooms 6 0 3.0 0.7 1.0 3.0 6.0
bathrooms 4 0 1.3 0.5 1.0 1.0 4.0
stories 4 0 1.8 0.9 1.0 2.0 4.0

Arbitrary Distributions

While this provides us with a lot of useful information, it lacks customization. On the other hand, datasummary() provides us with much more versatility. To start, take a look at the below example. We are summarizing price, lotsize, and bedrooms from the dataset df. For each variable, we are going to calculate length (sample size), Mean, SD, Min, and Max.

Code
datasummary(price + lotsize + bedrooms ~ length + Mean + SD + Min + Max, data = df)
length Mean SD Min Max
price 546.00 68121.60 26702.67 25000.00 190000.00
lotsize 546.00 5150.27 2168.16 1650.00 16200.00
bedrooms 546.00 2.97 0.74 1.00 6.00

Arbitrary Distributions

To further complicate this, we can include many other options. However, for simplicity, we are going to focus on the most important ones. First, we are going give the table a title. Second, we are going to rename the variables. Take a look:

Code
title <- "This is the Table's Caption/Title"
frmla <- (`Price` = price) + (`Lot Size` = lotsize) + (`Bedrooms` = bedrooms) ~
  (`N` = length) + Mean + (`St. Dev.` = SD) + Min + Max
datasummary(frmla, data = df, title = title)
This is the Table's Caption/Title
N Mean St. Dev. Min Max
Price 546.00 68121.60 26702.67 25000.00 190000.00
Lot Size 546.00 5150.27 2168.16 1650.00 16200.00
Bedrooms 546.00 2.97 0.74 1.00 6.00